Need to be careful of the balance of the power of the feature and computation power
3 kinds of machine learning problems:
Here we will try to recreate the example given in the textbook.
target $t$ is generated from $sin(2\pi x)$ with a gaussian distribution as noise
what we want to do is try to use this $x$ to try and see if we can make predictions of the value $\hat{t}$ for some new input value $\hat{x}$
Here we describe a polynomial function which is nonlinear function of $x$, but linear function of $w$. Linear parameters describe a set of functions that have important properties are called linear models.
$$ \begin{equation} y(x,W) = w_0 + w_1 x + w_2 x^2 + ... + w_M x^M = \sum^{M}_{j=0} w_j x^j \end{equation} $$We need Error Functions to give an estimate of how close the values are compared to the given target values. Note that an error function is a function of the learned parameter, not input nor output. Here is an example of SSE (sum of squares of error). Here we will need to come up with parameter $w$ such that we minimize the error between the given input and the target output.
$$ E(w) = \frac{1}{2} \sum^{N}_{n=1}(y(x_n, W) - t_n)^2 $$Note: Non-negative, Convex
In [1]:
%matplotlib inline
import matplotlib
import numpy as np
from numpy import random
import matplotlib.pyplot as plt
# Generate a example function sin(2*pi*x)
num_points = 10
x = np.linspace(0., 1., num_points)
t = np.sin(2.0 * np.pi * x) + np.random.normal(0, 0.15, num_points)
x_dense = np.linspace(0., 1., 100)
truth = np.sin(2.0 * np.pi * x_dense)
# Plot
plt.plot(x_dense, truth, 'g')
plt.plot(x, t, 'ro')
plt.title('Example 1.1');
We can solve for the optimal $w$ by finding the derivative of the error function. Since, the error function is linear in terms of w, $E(w)$, the output has a unique solution.
Note that there is a problem of picking M. This is a process called Model Selection, since we change the family of function used to try to learn the 'oracle' function.
RMS Error
$$ E_{RMs} = \sqrt{2E(\bar{w})/N} $$Solving Practical Application. Need to find way to find a suitable value for the model complexity. Simple method to is to keep a training data with validation set to fine-tune the model complexity.
i.e.
$$ \tilde{E}(\boldsymbol{w}) = \frac{1}{2}\sum^{N}_{n=1}\{y(x_n, \boldsymbol{w}) - t_n\}^2 + \frac{\lambda}{2}\lVert\boldsymbol{w}\rVert_2 $$combined with decision theory, we can make optimal prediction with uncertain data
Note: by definition, Probabilities need to lie in between [0, 1]
Product Rule: used to determine the probability of 2 independent events, by multiplying the other events by their marginal values $$ p(X=x_i, Y=y_i) = p(Y=y_i|X=x_i)p(X=x_i) $$
Symmetry Property $$ p(X,Y) = p(Y,X) $$
Bayes' Theorem: equation to calculate the posterior probability (probability given the observations). $$ p(Y|X) = \frac{p(X|Y)p(Y)}{p(X)} $$
Prior: probability available before we make observations, $p(Y)$
Cumulative Distribution Function: probability that x lies in some interval $(-\infty, z)$ $$ P(z) = \int^{z}_{-\infty}p(x)dx $$
For multivariate probability, where $\boldsymbol{x} = (x_1, x_2, ... , x_n)$ and the probability turns into a joint probability with the vector, $p(\boldsymbol{x} = p(x_1, x_2, ... , x_n)$, the above rule for total probability and summation needs to hold.
For discrete case, it is called probability mass function
For probability distributions, it is also noted that they follow the product and summation rule
bayesian vs. frequentist
bayesians say there is only one dataset and the uncertainty in the parameter is expressed through distribution over the model
Frequentist consider measure of the frequency of repeated events, thus, model parameters are fixed and data to be random
In [ ]: